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Abstract 

Exploring  and  analyzing  large  volumes  of  data  plays  an  increasingly  important  role  in  many  domains 
of  scientific  research.  We  have  been  developing  the  Active  Data  Repository  (ADR),  an  infrastructure  that 
integrates  storage,  retrieval,  and  processing  of  large  multi-dimensional  scientific  datasets  on  distributed 
memory  parallel  machines  with  multiple  disks  attached  to  each  node.  In  earlier  work,  we  proposed  three 
strategies  for  processing  range  queries  within  the  ADR  framework.  Our  experimental  results  show  that 
the  relative  performance  of  the  strategies  changes  under  varying  application  characteristics  and  machine 
configurations.  In  this  work  we  describe  analytical  models  to  predict  the  average  computation,  I/O  and 
communication  operation  counts  of  the  strategies  when  input  data  elements  are  uniformly  distributed  in 
the  attribute  space  of  the  output  dataset,  restricting  the  output  dataset  to  be  a  regular  d-dimensional  array. 
We  validate  these  models  for  various  synthetic  datasets  and  for  several  driving  applications. 


1  Introduction 

The  exploration  and  analysis  of  large  datasets  is  playing  an  increasingly  central  role  in  many  areas  of 
scientific  research.  Over  the  past  several  years  we  have  been  actively  working  on  data  intensive  applications 
that  employ  large-scale  scientific  datasets,  including  applications  that  explore,  compare,  and  visualize 
results  generated  hy  large  scale  simulations  [8],  visualize  and  generate  data  products  from  global  coverage 
satellite  data  [4],  and  visualize  and  analyze  digitized  microscopy  images  [1].  Such  applications  often  use 
only  a  subset  of  all  the  data  available  in  both  the  input  and  output  datasets.  References  to  data  items  are 
described  by  a  range  query,  namely  a  multi-dimensional  bounding  box  in  the  underlying  multi-dimensional 
attribute  space  of  the  dataset(s).  Only  the  data  items  whose  associated  coordinates  fall  within  the  multi¬ 
dimensional  box  are  retrieved  and  processed.  The  processing  structures  of  these  applications  also  share 
common  characteristics.  Figure  1  shows  high-level  pseudo-code  for  the  basic  processing  loop  in  these 
applications.  The  processing  steps  consist  of  retrieving  input  and  output  data  items  that  intersect  the  range 
query  (steps  1-2  and  4-5),  mapping  the  coordinates  of  the  retrieved  input  items  to  the  corresponding  output 
items  (step  6),  and  aggregating,  in  some  way,  all  the  retrieved  input  items  mapped  to  the  same  output  data 
items  (steps  7-8).  Correctness  of  the  output  data  values  usually  does  not  depend  on  the  order  input  data  items 
are  aggregated.  The  mapping  function,  Map{ie),  maps  an  input  item  to  a  set  of  output  items.  We  extend 
the  computational  model  to  allow  for  an  intermediate  data  structure,  referred  to  as  an  accumulator,  that  can 
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O  ^  Output  Dataset,  /  ^  Input  Dataset 
(*  Initialization  *) 

1 .  foreach  Oe  in  O  do 

2.  read  Oe 

3.  tte  Initialize(oe) 

(*  Reduction  *) 

4.  foreach  4  in  I  do 

5.  read  4 

6.  Sa  ^  Map(ie) 

I.  foreach  in  Sa  do 

8.  cie  z-  Aggregate{ie,ae) 

(*  Output  *) 

9.  foreach  Ue  do 

10.  Oe  ^  Output(ae) 

I I .  write  Oe 


Figure  1 :  The  basic  processing  loop  in  the  target  applications. 


be  used  to  hold  intermediate  results  during  processing.  For  example,  an  accumulator  can  be  used  to  keep  a 
running  sum  for  an  averaging  operation.  The  aggregation  function,  Aggregate{ie,  ap),  aggregates  the  value 
of  an  input  item  with  the  intermediate  result  stored  in  the  accumulator  element  (ap).  The  output  dataset 
from  a  query  is  usually  much  smaller  than  the  input  dataset,  hence  steps  4-8  are  called  the  reduction  phase 
of  the  processing.  Accumulator  elements  are  allocated  and  initialized  (step  3)  before  the  reduction  phase. 
The  intermediate  results  stored  in  the  accumulator  are  post-processed  to  produce  final  results  (steps  9-1 1). 

We  have  been  developing  the  Active  Data  Repository  (ADR)  [2],  a  software  system  that  efficiently 
supports  the  processing  loop  shown  in  Figure  1,  integrating  storage,  retrieval,  and  processing  of  large 
multi-dimensional  scientific  datasets  on  distributed  memory  parallel  machines  with  multiple  disks  attached 
to  each  node.  ADR  is  designed  as  a  set  of  modular  services  implemented  in  C-i-i-.  Through  use  of 
these  services,  ADR  allows  customization  for  application  specific  processing  (i.e.  the  Initialize,  Map, 
Aggregate,  and  Output  functions  described  above),  while  providing  support  for  common  operations  such 
as  memory  management,  data  retrieval,  and  scheduling  of  processing  across  a  parallel  machine.  The  system 
architecture  of  ADR  consists  of  a  front-end  and  a  parallel  back-end.  The  front-end  interacts  with  clients, 
and  forwards  range  queries  with  references  to  user-defined  processing  functions  to  the  parallel  back-end. 
During  query  execution,  back-end  nodes  retrieve  input  data  and  perform  user-defined  operations  over  the 
data  items  retrieved  to  generate  the  output  products.  Output  products  can  be  returned  from  the  back-end 
nodes  to  the  requesting  client,  or  stored  in  ADR. 

In  earlier  work  [3,7],  we  described  three  potential  processing  strategies,  and  evaluated  the  relative  per¬ 
formance  of  these  strategies  for  several  application  scenarios  and  machine  configurations.  Our  experimental 
results  showed  that  the  relative  performance  of  the  strategies  changes  under  varying  application  charac¬ 
teristics  and  machine  configurations.  In  this  paper  we  describe  analytical  models  to  predict  the  average 
operation  counts  of  the  strategies  when  input  data  elements  are  uniformly  distributed  in  the  attribute  space 
of  the  output  dataset,  restricting  the  output  dataset  to  be  a  regular  d-dimensional  array.  We  validate  these 
cost  models  with  queries  for  synthetic  datasets  and  for  several  driving  applications  [1,  4,  8]. 
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2  Overview  of  ADR 


In  this  section  we  briefly  describe  three  strategies  for  processing  range  queries  in  ADR.  First  we  briefly 
describe  how  datasets  are  stored  in  ADR,  and  outline  the  main  phases  of  query  execution  in  ADR.  More 
detailed  descriptions  of  these  strategies  and  of  ADR  in  general  can  be  found  in  [2,  3,7]. 

2.1  Storing  Datasets  in  ADR 

A  dataset  is  partitioned  into  a  set  of  chunks  to  achieve  high  bandwidth  data  retrieval.  A  chunk  consists 
of  one  or  more  data  items,  and  is  the  unit  of  I/O  and  communication  in  ADR.  That  is,  a  chunk  is  always 
retrieved,  communicated  and  computed  on  as  a  whole  during  query  processing.  Every  data  item  is  associated 
with  a  point  in  a  multi-dimensional  attribute  space,  so  every  chunk  is  associated  with  a  minimum  bounding 
rectangle  (MBR)  that  encompasses  the  coordinates  (in  the  associated  attribute  space)  of  all  the  data  items 
in  the  chunk.  In  the  remaining  of  this  paper,  we  use  the  MBR  of  a  chunk  to  determine  the  extent,  the 
volume,  the  mid-point,  and  the  top-right  corner  of  the  chunk.  Since  data  is  accessed  through  range  queries, 
it  is  desirable  to  have  data  items  that  are  close  to  each  other  in  the  multi-dimensional  space  placed  in  the 
same  chunk.  Chunks  are  distributed  across  the  disks  attached  to  ADR  back-end  nodes  using  a  declustering 
algorithm  [5,  9]  to  achieve  I/O  parallelism  during  query  processing.  Each  chunk  is  assigned  to  a  single  disk, 
and  is  read  and/or  written  during  query  processing  only  by  the  local  processor  to  which  the  disk  is  attached. 
If  a  chunk  is  required  for  processing  by  one  or  more  remote  processors,  it  is  sent  to  those  processors  by  the 
local  processor  via  interprocessor  communication.  After  all  data  chunks  are  stored  into  the  desired  locations 
in  the  disk  farm,  an  index  (e.g.,  an  R-tree  [6])  is  constructed  using  the  MBRs  of  the  chunks.  The  index  is 
used  by  the  back-end  nodes  to  find  the  local  chunks  with  MBRs  that  intersect  the  range  query. 

2.2  Query  Processing  in  ADR 

Processing  of  a  query  in  ADR  is  accomplished  in  two  steps;  query  planning  and  query  execution. 

A  plan  specifies  how  parts  of  the  final  output  are  computed  and  the  order  the  input  data  chunks  are 
retrieved  for  processing.  Planning  is  carried  out  in  two  steps;  tiling  and  workload  partitioning.  In  the  tiling 
step,  if  the  output  dataset  is  too  large  to  fit  entirely  into  the  memory,  it  is  partitioned  into  output  tiles.  Each 
output  tile  contains  a  distinct  subset  of  the  output  chunks,  so  that  the  total  size  of  the  chunks  in  an  output 
tile  is  less  than  the  amount  of  memory  available  for  output  data.  Tiling  of  the  output  implicitly  results  in 
a  tiling  of  the  input  dataset.  Each  input  tile  contains  the  input  chunks  that  map  to  the  output  chunks  in  the 
output  tile.  Similar  to  data  chunks,  an  output  tile  is  associated  with  a  MBR  that  encompasses  the  MBRs  (in 
the  associated  attribute  space)  of  all  the  output  chunks  in  the  tile.  During  query  processing,  each  output  tile 
is  cached  in  main  memory,  and  input  chunks  from  the  required  input  tile  are  retrieved.  Since  a  mapping 
function  may  map  an  input  element  to  multiple  output  elements,  an  input  chunk  may  appear  in  more  than 
one  input  tile  if  the  corresponding  output  chunks  are  assigned  to  different  tiles.  Hence,  an  input  chunk  may 
be  retrieved  multiple  times  during  execution  of  the  processing  loop.  In  the  workload  partitioning  step,  the 
workload  associated  with  each  tile  (i.e.  aggregation  of  input  items  into  accumulator  chunks)  is  partitioned 
across  processors.  This  is  accomplished  by  assigning  each  processor  the  responsibility  for  processing  a 
subset  of  the  input  and/or  accumulator  chunks. 

The  execution  of  a  query  on  a  back-end  processor  progresses  through  four  phases  for  each  tile: 

1 .  Initialization.  Accumulator  chunks  in  the  current  tile  are  allocated  space  in  memory  and  initialized. 
If  an  existing  output  dataset  is  required  to  initialize  accumulator  elements,  an  output  chunk  is  retrieved 
by  the  processor  that  has  the  chunk  on  its  local  disk,  and  the  chunk  is  forwarded  to  the  processors  that 
require  it. 
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2.  Local  Reduction.  Input  data  chunks  on  the  local  disks  of  each  hack-end  node  are  retrieved  and 
aggregated  into  the  accumulator  chunks  allocated  in  each  processor’s  memory  in  phase  1. 

3.  Global  Combine.  If  necessary,  results  computed  in  each  processor  in  phase  2  are  combined  across 
all  processors  to  compute  final  results  for  the  accumulator  chunks. 

4.  Output  Handling.  The  final  output  chunks  for  the  current  tile  are  computed  from  the  corresponding 
accumulator  chunks  computed  in  phase  3. 

A  query  iterates  through  these  phases  repeatedly  until  all  tiles  have  been  processed  and  the  entire  output 
dataset  has  been  computed.  To  reduce  query  execution  time,  ADR  overlaps  disk  operations,  network 
operations  and  processing  as  much  as  possible  during  query  processing.  Overlap  is  achieved  by  maintaining 
explicit  queues  for  each  kind  of  operation  (data  retrieval,  message  sends  and  receives,  data  processing)  and 
switching  between  queued  operations  as  required.  Pending  asynchronous  I/O  and  communication  operations 
in  the  queues  are  polled  and,  upon  their  completion,  new  asynchronous  operations  are  initiated  when  there 
is  more  work  to  be  done  and  memory  buffer  space  is  available.  Data  chunks  are  therefore  retrieved  and 
processed  in  a  pipelined  fashion. 

2.3  Query  Processing  Strategies 

In  the  following  discussion,  we  refer  to  an  input/output  data  chunk  stored  on  one  of  the  disks  attached  to  a 
processor  as  a  local  chunk  on  that  processor.  Otherwise,  it  is  a  remote  chunk.  A  processor  owns  an  input 
or  output  chunk  if  it  is  a  local  input  or  output  chunk.  A  ghost  chunk  is  a  copy  of  an  accumulator  chunk 
allocated  in  the  memory  of  a  processor  that  does  not  own  the  corresponding  output  chunk. 

In  the  tiling  phase  of  all  the  strategies  described  in  this  section,  we  use  a  Hilbert  space-filling  curve  [5] 
to  create  the  tiles.  The  goal  is  to  minimize  the  total  length  of  the  boundaries  of  the  tiles,  by  assigning 
chunks  that  are  spatially  close  in  the  multi-dimensional  attribute  space  to  the  same  tile,  to  reduce  the  number 
of  input  chunks  crossing  tile  boundaries.  The  advantage  of  using  Hilbert  curves  is  that  they  have  good 
clustering  properties  [9],  since  they  preserve  locality.  In  our  implementation,  the  mid-point  of  the  bounding 
box  of  each  output  chunk  is  used  to  generate  a  Hilbert  curve  index.  The  chunks  are  sorted  with  respect  to 
this  index,  and  selected  in  this  order  for  tiling. 

Fully  Replicated  Accumulator  (FRA)  Strategy.  In  this  scheme  each  processor  performs  processing  as¬ 
sociated  with  its  local  input  chunks.  The  output  chunks  are  partitioned  into  tiles,  each  of  which  fits  into  the 
available  local  memory  of  a  single  back-end  processor.  When  an  output  chunk  is  assigned  to  a  tile,  the  cor¬ 
responding  accumulator  chunk  is  put  into  the  set  of  local  accumulator  chunks  in  the  processor  that  owns  the 
output  chunk,  and  is  assigned  as  a  ghost  chunk  on  all  other  processors.  This  scheme  effectively  replicates  all 
of  the  accumulator  chunks  in  a  tile  on  each  processor,  and  during  the  local  reduction  phase,  each  processor 
generates  partial  results  for  the  accumulator  chunks  using  only  its  local  input  chunks.  Ghost  chunks  with 
partial  results  are  then  forwarded  to  the  processors  that  own  the  corresponding  output  (accumulator)  chunks 
during  the  global  combine  phase  to  produce  the  complete  intermediate  result,  and  eventually  the  final  output 
product. 

Sparsely  Replicated  Accumulator  (SRA)  Strategy.  The  FRA  strategy  replicates  each  accumulator  chunk 
in  every  processor,  even  if  no  input  chunks  will  be  aggregated  into  the  accumulator  chunks  in  some  proces¬ 
sors.  This  results  in  unnecessary  initialization  overhead  in  the  initialization  phase  of  query  execution,  and 
extra  communication  and  computation  in  the  global  combine  phase.  The  available  memory  in  the  system 
also  is  not  efficiently  employed,  because  of  unnecessary  replication.  Such  replication  may  result  in  more 
tiles  being  created  than  necessary,  which  may  cause  a  large  number  of  input  chunks  to  be  retrieved  from 
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Communication  for  Input  Elements 
(Black  regions  represent  the  clipped  out  regions  of  triangles) 
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Communication  for  Replicated  Output  Blocks 

Figure  2:  FRA  strategy  (left)  and  DA  strategy  (right). 


disk  more  than  once.  In  SRA  strategy,  a  ghost  chunk  is  allocated  only  on  processors  owning  at  least  one 
input  chunk  that  maps  to  the  corresponding  accumulator  chunk. 

Distributed  Accumulator  (DA)  Strategy.  In  this  scheme,  every  processor  is  responsible  for  all  processing 
associated  with  its  local  output  chunks.  Tiling  is  done  hy  selecting,  for  each  processor,  local  output  chunks 
from  that  processor  until  the  memory  space  allocated  for  the  corresponding  accumulator  chunks  in  the 
processor  is  filled.  As  in  the  other  schemes,  output  chunks  are  selected  in  Hilhert  curve  order. 

Since  no  accumulator  chunks  are  replicated  hy  the  DA  strategy,  no  ghost  chunks  are  allocated.  This 
allows  DA  to  make  more  effective  use  of  memory  and  produce  fewer  tiles  than  the  other  two  schemes. 
As  a  result,  fewer  input  chunks  are  likely  to  he  retrieved  for  multiple  tiles.  Furthermore,  DA  avoids 
interprocessor  communication  for  accumulator  chunks  during  the  initialization  phase  and  for  ghost  chunks 
during  the  global  combine  phase,  and  also  requires  no  computation  in  the  global  combine  phase.  On  the 
other  hand,  it  introduces  communication  in  the  local  reduction  phase  for  input  chunks;  all  the  remote  input 
chunks  that  map  to  the  same  output  chunk  must  he  forwarded  to  the  processor  that  owns  the  output  chunk. 
Since  a  projection  function  may  map  an  input  chunk  to  multiple  output  chunks,  an  input  chunk  may  he 
forwarded  to  multiple  processors. 

Figure  2  illustrates  the  FRA  and  DA  strategies  for  an  example  application.  One  possible  distribution 
of  input  and  output  chunks  to  the  processors  is  illustrated  at  the  top.  Input  chunks  are  denoted  by  triangles 
while  output  chunks  are  denoted  by  rectangles.  The  final  resulf  fo  be  compufed  by  reduction  (aggregation) 
operations  is  also  shown. 
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3  Assumptions  and  Definitions 


In  this  section,  we  describe  assumptions  about  our  cost  models  and  the  definitions  of  parameters  to  be 
used  in  the  models.  We  also  define  how  the  MBR  of  an  output  tile  is  partitioned  into  regions  for  analysis 
purpose.  As  to  be  seen  later,  the  regions  will  be  used  to  estimate  the  expected  number  of  output  tiles  an 
input  chunk  maps  to  and  the  expected  number  of  messages  that  DA  generates  for  an  input  chunk  during  the 
local  reduction  phase. 

3.1  Assumptions 

The  cost  models  presented  in  this  paper  make  the  following  assumptions. 

•  A  shared-nothing  architecture  with  local  disks  is  employed. 

•  All  processors  are  assumed  to  have  the  same  amount  of  memory. 

•  All  input  chunks  are  of  the  same  number  of  bytes,  and  have  the  same  extent  when  mapped  into  the 
output  attribute  space. 

•  All  output  chunks  are  of  the  same  number  of  bytes,  and  their  extents  form  a  regular  multi-dimensional 
grid. 

•  All  input  chunks  map  to  the  same  number  of  output  chunks,  and  all  output  chunks  are  mapped  to  by 
the  same  number  of  input  chunks. 

•  An  accumulator  chunk  is  assumed  to  have  the  same  number  of  bytes  as  that  of  its  corresponding 
output  chunk. 

•  Input  chunks  and  output  chunks  are  assumed  to  be  distributed  among  processors  by  a  declustering 
algorithm  that  achieves  perfect  declustering;  that  is,  all  the  input  (output)  chunks  whose  MBRs 
intersect  a  given  range  query  are  distributed  to  as  many  processors  as  possible. 

3.2  Definitions 

We  define  the  parameters  to  be  used  in  the  rest  of  this  paper. 

I  :  the  total  number  of  input  chunks  required  by  the  given  query. 

U  :  the  total  number  of  output  chunks  to  be  computed  by  the  given  query. 
i  :  the  size  of  an  input  chunk. 
u  :  the  size  of  an  output  chunk. 

P  :  the  number  of  processors. 

M  :  the  available  memory  size  of  a  processor. 
a  :  the  number  of  output  chunks  an  input  chunk  maps  to. 
l3  :  the  number  of  input  chunks  an  output  chunk  is  mapped  to. 
d  :  the  number  of  dimensions  for  the  output  attribute  space. 
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Figure  3:  (a)  An  example  of  a  2-dimensional  output  dataset  partitioned  into  3x3  output  tiles,  and  input 
chunk  il,  i2  and  i3,  whose  mid-points  fall  inside  output  tile  t.  Input  chunk  il,  i2  and  i3  map  to  one,  two  and 
four  output  tiles,  respectively,  (h)  An  alternative  way  to  partition  an  output  tile,  based  on  the  mid-points 
of  the  input  chunks.  Xj  and  y^,  for  j  G  {0, 1},  are  the  extents  of  an  output  tile  and  an  input  chunk  along 
dimension  j,  respectively. 


Xj  where  j  =  0, 1 , . . . ,  d  —  1 :  the  extent  of  an  output  tile  along  dimension  j  of  the  output  attribute  space. 
yj  where  j  =  0, 1 , . . . ,  d  —  1 :  the  extent  of  an  input  chunk  along  dimension  j  of  the  output  attribute  space. 
Zj  where  j  =  0, 1 , . . . ,  d  —  1 :  the  extent  of  an  output  chunk  along  dimension  j  of  the  output  attribute  space. 
As  :  the  average  number  of  input  chunks  that  map  to  an  output  tile  under  strategy  s. 

Bs  :  the  average  number  of  output  chunks  assigned  to  an  output  tile  under  strategy  s. 

Ts  :  the  total  number  of  output  tiles  under  strategy  s. 

The  values  of  many  parameters  can  be  obtained  by  accessing  the  index  of  the  input  and  output  datasets.  In 
practice,  averages  are  used  when  single  values  cannot  be  assumed  for  i,  u,  a,  fi,  y^  and  z^. 

3.3  Partitioning  MBR  of  An  Output  Tile  Into  Regions 

For  analysis  purpose,  the  MBR  of  an  output  tile  is  partitioned  into  several  regions.  Figure  3(a)  shows  an 
example  of  a  2-dimensional  output  dataset  partitioned  into  3x3  output  tiles  with  three  input  chunks.  The 
MBRs  of  the  output  tiles  are  shown  as  white  rectangles,  while  the  MBRs  of  the  input  chunks  are  shown  as 
the  shaded  rectangles.  In  this  example,  input  chunk  il  maps  to  one  output  tile,  i2  maps  to  two  output  tiles, 
and  i3  maps  to  four  output  tiles.  Let’s  consider  all  the  input  chunks  in  I  whose  mid-points  fall  inside  an 
output  tile,  such  as  il,  i2  and  i3  of  Figure  3(a)  falling  inside  output  tile  t,  and  group  those  input  chunks  by 
the  number  of  output  tiles  they  map  to.  Assuming  that  an  input  chunk  has  a  smaller  extent  than  that  of  an 
output  tile,  the  grouping  of  the  input  chunks  implies  a  partitioning  of  the  MBR  of  output  tile  t  into  three 
regions,  R\,  R2  and  R4,  such  that  all  input  chunks  whose  mid-points  fall  inside  region  Rj  map  to  exactly  j 
output  tiles. 

An  alternative  way  to  partition  the  MBR  of  output  tile  t  is  by  considering  all  the  input  chunks  in  / 
whose  top-right  corners  fall  inside  t.  Figure  4(a)  shows  an  example  of  a  2-dimensional  output  dataset 
partitioned  into  3x3  output  tiles,  with  three  input  chunks  whose  MBRs  are  shown  as  shaded  rectangles. 
In  this  example,  input  chunk  il  maps  to  one  output  tile,  i2  maps  to  two  output  tiles,  with  two  along  the  first 
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0 
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Figure  4:  (a)  An  example  of  a  2-dimensional  output  dataset  partitioned  into  3x3  output  tiles,  and  input 
chunk  il,  i2  and  i3,  whose  top-right  corners  fall  inside  output  tile  t.  Input  chunk  il,  i2  and  i3  map  to  one, 
two  and  four  output  tiles,  respectively,  (h)  Partitioning  of  output  tile  t  into  four  regions  when  the  extents  of 
input  chunks  are  smaller  than  that  of  an  output  tile. 


dimension  (ie  the  horizontal  dimension)  and  one  along  the  second  dimension  (ie  the  vertical  dimension), 
and  i3  maps  to  four  output  tiles,  with  two  along  each  of  the  two  dimensions.  Let’s  consider  all  the  input 
chunks  in  I  whose  top-right  comers  fall  inside  an  output  tile,  such  as  il,  i2  and  i3  of  Figure  4(a)  falling 
inside  output  tile  t,  and  group  those  input  chunks  hy  the  number  of  output  tiles  they  map  to  along  each 
dimension.  Assuming  that  an  input  chunk  has  a  smaller  extent  than  that  of  an  output  tile,  the  grouping  of  the 
input  chunks  implies  a  partitioning  of  output  tile  t  into  four  regions,  72<i,i>,  72<i,2>>  ^<2,i>  and  72<2,2>, 
such  that  all  input  chunks  whose  top-right  corners  fall  inside  region  map  to  exactly  j  X  k  output 

tiles:  j  output  tiles  along  dimension  0  and  k  output  tiles  along  dimension  1.  These  regions  are  shown  in 
Figure  4(h).  For  example,  in  Figure  4(a),  input  chunk  i2  belongs  to  region  72<2,i>,  while  i3  belongs  to 
region  72<2,2>-  In  general,  a  d-dimensional  output  tile  can  be  partitioned  into  2'^  regions,  each  of  which 
labeled  as  'JZ<vQ,vi,...,va-i>’  and  an  input  chunk  whose  top-right  comer  falls  inside  region  TZy  maps  to  Vj 
output  tiles  along  dimension  j  ,  for  j  =  0, 1 , . . . ,  d  —  1.  Since  we  assume  that  the  input  chunks  are  randomly 
distributed  in  the  output  space,  the  ratio  between  the  volume  of  region  'JZ<vo,vi,...,va-i>  and  the  total  volume 
of  an  output  tile  can  be  used  as  an  estimate  for  the  probability  that  an  input  chunk  would  map  to  Vj  output 
tiles  along  dimension  j,  for  j  =  0, 1, . . . ,  d  —  1.  In  the  rest  of  this  subsection,  we  derive  the  volumes  for 
these  regions. 

Note  that  region  72<i,i>  in  Figure  4(b)  corresponds  to  region  Ry  in  Figure  3,  72<i,2>  and  72<2,i> 
together  correspond  to  i?2,  and  72<2,2>  corresponds  to  R4.  Although  the  two  approaches  of  partitioning  an 
output  tile  generate  different  regions,  each  pair  of  corresponding  regions  from  the  two  approaches  have  the 
same  volume.  Since  the  approach  based  on  the  top-right  corners  of  input  chunks  extends  more  naturally 
to  the  scenario  where  input  chunks  have  larger  extents  than  that  of  an  output  tile,  we  will  refer  to  regions 
generated  by  this  approach  during  discussion  in  the  remaining  of  this  paper. 

Assume  that  the  d-dimensional  output  grid  is  partitioned  regularly  into  rectangular  output  tiles  and  there 
are  B  output  chunks  per  output  tile.  Let  each  output  chunk  have  a  minimum  bounding  rectangle  (MBR) 
of  size  Zj  along  dimension  j  =  0, 1, . . . ,  d  —  1.  Then,  the  extent  of  the  MBR  for  an  output  tile  in  each 
dimension  can  be  computed  as  Xj  =  for  j  =  0, 1 , . . . ,  d  —  1,  where  is  the  number  of  output  chunks 
along  dimension  j  of  the  output  tile.  In  our  analysis,  we  will  assume  that  no  =  ny  =  ■■■  =  n^-y,  and 
therefore  we  have  nj  =  s/~B.  We  now  derive  the  volumes  of  the  regions  described  earlier.  We  first  consider 
the  scenario  where  the  input  chunks  have  smaller  extent  than  that  of  an  output  tile,  and  later  consider  the 


scenario  where  the  input  chunks  have  larger  extent  than  that  of  an  output  tile. 

Small  Input  Chunks:  (ie  Xj  >  r/j,  Vj  =  0, 1, . . . ,  -  1) 
d  =  2:  (see  Figure  4(h)) 


vol(72<i^i>) 

II 

O 

1 

o 

1 

vo1(72<i,2>) 

II 

o 

1 

o 

vo1(72<2,i>) 

II 

o 

1 

Vo1(72<2,2>) 

=  yovi 

Vol(72<yyi>) 

=  (xo  -  yo)(xi  -  yi)(x2 

vo1(72<i,i,2>) 

=  (xo  -  yo)(xi  -  yi)y2 

vo1(72<p2,i>) 

=  (*0  -  yo)yi(x2  -  yi) 

vo1(72<p2,2>) 

=  (*0  -  yo)yiy2 

vo1(72<2,i,i>) 

=  yo{x\- y\){x2- y2) 

Vo1(72<2,1,2>) 

=  yo{x\  -  yi)y2 

Vo1(72<2,2,1>) 

=  yoy\{x2-y2) 

Vo1(72<2,2,2>) 

=  yoym 

Let  function  r(s,  f,  j,  5)  he  defined  as  follows. 


ris,t,j,s)  = 


if  j  e  s 

otherwise 


where  s  and  t  are  scalars,  j  is  an  integer  and  5  is  a  set  of  integers.  That  is,  for  a  given  set  of  integers  S, 
r(s,  t,j,  S)  returns  s  if  j  belongs  to  set  S;  otherwise,  it  returns  t.  In  general,  region  ^ 

d-dimensional  hyper-hox.  Since  the  input  chunks  are  assumed  to  have  smaller  extents  than  that  of  an  output 
tile,  an  input  chunk  can  only  map  to  one  or  two  output  tiles  along  any  particular  dimension.  Hence,  we  have 
Vj  G  {1,2}.  As  suggested  hy  Figure  4(a)  when  d  =  2,  the  extent  of  region  along  dimension 

i  is  either  when  Vj  =  1,  or  when  Vj  =  2.  Therefore,  the  volume  of  region  can 

he  computed  as  follows. 

d-i 

vol(^<t,o,tJi,...,tJd_i> )  “  n  ~  dj,  dj,  {1})  (1) 

j=0 


Note  that  there  are  2'^  regions  in  total. 

Large  luput  Chuuks:  (ie  3j  G  [0,  d  -  1],  such  that  x  J  <  Vj) 

Let  r  J  he  the  smallest  number  of  output  tiles  along  dimension  j  in  the  output  attribute  space  that  can  entirely 
contain  the  MBR  of  an  input  chunk  along  that  dimension.  That  is, 

for  i  =  0,1,2,  ...,d- 1 

Xj 

Note  that  this  implies  the  following,  for  j  =  0, 1, 2, . . . ,  d  —  1. 


(rj  -l)Xj  <  yj  <  TjXj 
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Figure  5:  (a)  An  example  of  an  output  dataset  partitioned  into  6x4  output  tiles,  and  input  chunks  il  and 
i2,  whose  extents  map  to  six  and  eight  output  tiles,  respectively,  (h)  Partitioning  of  output  tile  t  into  four 
regions  when  the  extents  of  input  chunks  are  larger  than  that  of  an  output  tile. 


Figure  5(a)  shows  the  output  dataset  partitioned  into  6x4  output  tiles,  and  two  input  chunks  with  extents 
larger  than  that  of  an  output  tile  mapping  to  six  and  eight  output  tiles,  respectively.  In  this  example,  we 
have  ro  =  3  and  ri  =  2.  Figure  5(h)  shows  how  output  tile  t  is  partitioned  into  four  regions  according  to 
the  same  way  of  grouping  input  tiles  whose  top-right  corners  fall  inside  tile  t  that  was  used  earlier. 

We  now  compute  the  volume  for  region  'JZ<vo,vi,--;Vd-i>'>  Let’s  first  consider  the  case  where  the  output 
space  is  2-dimensional. 


vol(7^<^„,^,>) 
VOl(7^<ro-|-l,i’i>  ) 
VOl(7^<ro,ri-l-l>  ) 

VOl(7^<ro-|-l,i’i-|-l>  ) 


(roxo  -  yo)(riXi  -  yi) 

[yo  -  (ro  -  \)xo\{r],xi  -  yi) 

(rQXQ  -  yQ)[y\  -  (ri  -  l)a;i] 

[yo  -  (ro  -  l)a;o]bi  -  (ri  -  l)xi] 


Extending  the  analysis  carried  out  for  a  2-dimensional  attribute  space  to  a  d-dimensional  output  space,  we 
have  the  following. 


VOl(^<ro,ri  ,...,rd_i  >  ) 
VOl(^<ro-|-l,i'i,...,rd_i>) 

VOl(7^<ro,ri,.. 


(roxo  -  yo)(r\X\  -  y\)  ■  ■■{rd-\Xd-\  -  yd-\) 

[yo  -  (ro  -  l)a;o](ria;i  -  yi)  ■  ■■(rd-iXd-i  -  yd-i) 
(roxo  -  yo)(riXi  -  yi)  ■  --(rk-iXk-i  -  yk-i) 

[yk  -  (rk  -  l)xk](rk+ixk+i  -  yk+i)  ■  ■  ■ 

(vd-iXd-i  -  yd-i) 


VOl(7^<ro,’ri  )  —  (?’0®0  yo)(^l®l  2/1 )  '  '  '  (^fc— 1  1  yk—l) 

[yk  -  (rk  -  l)xk](rk+ixk+i  -  yk+i)  ■  ■  ■ 

(rj-iXj-i  -  -  (rj  -  l)xj] 

(rj+iXd+i  -  yj+i)  ■  ■■(rd-ixd-i  -  yd-i) 
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vol(7^<ro+l,r,  +  l,...,r^_,  +  l>)  =  [yo- (ro- 'i^)xo][yi  -  (ri  -  i)xi]- ■  ■ 

[yd-i  -  (rd-i  -  i)xd-i] 

Note  that  an  input  chunk  can  only  map  to  Tj  or  Tj  +  1  output  tiles  along  a  particular  dimension.  Therefore, 
we  have  Vj  G  +  1},  and  there  are  2'^  regions  in  total.  Let  he  the  set  of  regions  from  partitioning 

output  tile  t. 

In  general,  the  extent  of  region  'JZ<vo,vi,...,va-i>  along  dimension  j  is  either  —  y^  when  Vj  =  r^,  or 
yj  —  (r-j  —  \)xj  when  Vj  =  Tj  +  1.  The  volume  of  region  'JZ<vo,vu--;Vd-i>’  therefore  he  computed  as 
follows. 


d-1 

vol(^<-uo,-ui,...,-ud_i  >)  =  '^(xjXj  —  yj,yj  —  (xj  —  l)xj,Vj,{rj})  (2) 

]=0 


Note  that  with  ro  =  ri  =  •  •  •  =  Xd-i  =  1,  Equation  (2)  becomes  Equation  (1).  Therefore,  in  the  rest  of  this 
paper,  we  will  use  Equation  (2)  to  compute  the  volume  of  a  region  for  both  scenarios. 

As  discussed  earlier,  the  ratio  between  the  volume  of  region  and  the  total  volume  of  an 

output  tile  can  be  used  as  an  estimate  for  the  probability  that  an  input  chunk  would  map  to  Vj  output  tiles  along 
dimension  j.  By  the  definition  of  an  input  chunk  that  belongs  to  region 

maps  to  11^=0  output  tiles.  Therefore,  the  expected  number  of  output  tiles  that  an  input  chunk  maps  to, 
A,  can  be  computed  as  follows. 


A  = 


n 


E 
1 


vol(an  output  tile) 


(noni  ■■■Vd-i) 


vol(an  output  tile)  ■■■Vd-i] 

l^v  G  I 


d-1 


XqX  1  *  *  *  X d—  1 


^  VQVi  •  --Vd-i  n  -  i^d  -  {^j}) 


(3) 


d=0 


4  Analytical  Cost  Models 

In  this  section  we  present  analytical  models  to  predict  the  average  operation  counts  of  the  three  query 
processing  strategies.  In  particular,  our  models  estimate  the  number  of  I/O,  communication,  and  computation 
operations  that  must  be  performed  by  an  processor  for  an  output  tile  in  each  of  the  query  processing  phases 
(initialization,  local  reduction,  global  combine  and  output  handling). 

Table  1  shows  the  expected  average  number  of  operations  per  processor  for  a  tile  in  each  phase.  In  the 
remaining  of  this  section,  we  describe  the  methods  used  to  compute  the  expected  number  of  operations.  The 
main  assumption  of  the  analytical  models  described  in  this  paper  is  that  the  distribution  of  the  input  chunks 
in  the  output  attribute  space  must  be  uniform,  and  the  output  dataset  must  be  a  regular  (/-dimensional  dense 
array. 

4.1  Computing  Operation  Counts  for  FRA 

The  number  of  output  tiles  and  the  average  number  of  output  chunks  in  an  output  tile  depend  on  the  aggregate 
system  memory  that  can  be  effectively  utilized  by  a  query  processing  strategy.  Since  an  output  chunk  is 
replicated  across  all  processors  for  ERA,  the  effective  system  memory  for  ERA  is  the  size  of  memory  on  a 
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Query 

Query  Processing  Strategy 

Execution 

FRA 

SRA 

DA 

Phase 

I/O 

Comm. 

Comp. 

I/O 

Comm. 

Comp. 

I/O 

Comm. 

Comp. 

Initialization 

Tlfra 

P 

£^(p_l) 

^  fra 

Bsra 

P 

9 

Bda 

P 

0 

Bda 

P 

Local 

Afra 

0 

Asra 

0 

Ada 

^msg 

Reduction 

P 

P 

P 

Global 

Combine 

0 

^(p-1) 

0 

9 

9 

0 

0 

0 

Output 

Bjr. 

0 

Bjr. 

Bsra 

0 

Bsra 

Bda 

0 

Bda 

Handling 

P 

P 

P 

P 

P 

P 

Table  1 :  The  expected  average  number  of  I/O,  communication,  and  computation  operations  per  processor 
for  a  tile  in  each  phase.  Bgra^  and  B^a  denote  the  expected  average  number  of  output  chunks  per 

tile  for  the  FRA,  SRA,  and  DA  strategies,  respectively.  Similarly,  Ajra>  and  Ada  are  the  expected 
average  number  of  input  chunks  retrieved  per  tile  for  the  FRA,  SRA,  and  DA  strategies,  g  is  the  expected 
average  number  of  ghost  chunks  per  processor  for  a  tile  in  SRA,  and  is  the  expected  average  number 
of  messages  per  processor  for  input  chunks  in  a  tile  for  DA.  The  average  number  of  output  chunks  that  an 
input  chunk  maps  to  is  denoted  by  a,  and  /3  represents  the  average  number  of  input  chunks  that  map  to  an 
output  chunk.  P  is  the  number  of  processors  executing  the  query. 


single  processor,  M .  Hence,  the  average  number  of  output  chunks  per  output  tile,  B  fra,  and  the  number  of 
output  tiles,  Tfra,  can  be  computed  as  follows. 


B  fra 

Tfra 


M 

u 

U  _ 

Bfra  M 


The  expected  extent  of  an  output  tile  along  dimension  j  is  computed  ass  j  =  ^/Bj^  Zj,  for  j  =  0,1, — 
1 .  Equation  (3)  (see  Section  3.3)  computes  the  expected  number  of  output  tiles  that  an  input  chunk  intersects. 
Therefore,  the  expected  number  of  input  chunks  out  of  /  input  chunks  that  map  to  a  given  output  tile  can  be 
computed  as  follows. 


^fra  — 


XI 

Tfra 


B  f 

Assuming  perfect  declustering,  each  processor  reads  output  chunks  during  the  initialization  phase, 

A  f 

and  input  chunks  during  the  local  reduction  phase.  Each  output  chunk  is  sent  to  P  —  1  processors, 
therefore  each  processor  sends  out  ( P  —  1 )  output  chunks  during  the  initialization  phase  and  ( P  —  1 ) 
output  chunks  during  the  global  combine  phase.  Since  each  output  chunk  is  mapped  to  by  /3  input  chunks, 
B frafi  computation  operations  are  carried  out  in  total  for  an  output  tile  of  B fra  output  chunks.  Assuming 
perfect  declustering  of  the  input  chunks  across  all  processors,  each  processor  is  responsible  for 
computation  operations  per  output  tile. 
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4.2  Computing  Operation  Counts  for  SRA 

Let  e  be  the  average  percent  of  system  memory  used  for  local  output  chunks  in  an  output  tile.  That  is,  if  g 
is  the  average  number  of  ghost  chunks  per  processor  per  output  tile,  we  have 

^sra 

e  =  - - 

Osra  T  9 

where  bgra  is  the  average  number  of  local  output  chunks  per  processor  per  output  tile.  Note  that  we  have 
^  <  e  <  1.  When  e  is  equal  to  i,  SRA  is  equivalent  to  FRA  and  g  =  bsra{P  —  !)•  Given  the  value  of  e, 
we  have  the  following. 

ePM 
u 

Bsra  _  eM 

P  u 

U  _  Uu 

Bsra  ePM 

We  compute  g  and  e  as  follows.  The  goal  of  the  declustering  algorithms  used  in  ADR  [5,  9]  is  to 
achieve  good  I/O  parallelism  when  retrieving  input  and  output  chunks  from  disks.  To  achieve  this  goal, 
the  algorithms  distribute  spatially  close  chunks  evenly  across  as  many  processors  as  possible.  Therefore,  /3 
input  chunks  that  map  to  an  output  chunk  on  processor  p  can  be  expected  to  be  distributed  across  as  many 
processors  as  possible.  Let  g'  be  the  average  number  of  ghost  chunks  that  are  created  for  an  output  chunk. 
Then,  with  bsra  local  output  chunks  per  processor  in  an  output  tile,  on  average  a  processor  creates  a  total  of 
g  =  bsra9'  ghost  chunks  per  output  tile,  and  P  processors  create  Pbsra9'  ghost  chunks  per  output  tile. 

Under  the  assumption  that  input  chunks  that  map  to  the  same  output  chunk  are  distributed  across  as 
many  processors  as  possible,  SRA  becomes  FRA  if  [3  >  P.  When  [3  <  P,  we  have 


Bs 

bs 

Ts. 


9 


/ 


And  hence. 


prob{p  is  one  of  [3  procs}(/3  —  1)  +  prob{p  is  not  one  of  [3  procs}/3 

^(,a-i)  +  (i-|)3 


e 


9 


^sra  +  9  b  sra  +  b 

sra9 

P  -  \ 

bsra9  —  bsra  /3 


1  P 

TT7  "  P  +  /3(P-  1) 


Similar  to  FRA,  the  expected  extent  of  an  output  tile  along  the  dimension  j  is  computed  as  Xj  =  ^Bsra  U  > 
for  j  =  0, 1, 1,  and  A  is  computed  by  Equation  (3).  The  expected  number  of  input  chunks  out  of  I 
input  chunks  that  intersect  with  a  given  output  tile  therefore  can  be  computed  as  follows. 


A  — 

-p^sra  — 


A/ 

Tsra 


Assuming  perfect  declustering,  each  processor  reads  bsra  =  output  chunks  during  the  initialization 
phase,  and  input  chunks  during  the  local  reduction  phase.  Each  processor  sends  g  messages  for  output 
chunks  during  the  initialization  phase,  and  g  messages  for  output  chunks  during  the  global  combine  phase. 
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Similar  to  FRA,  since  each  output  chunk  is  mapped  to  hy  /3  input  chunks,  computation  operations 

are  carried  out  in  total  for  an  output  tile  of  B^ra  output  chunks.  Assuming  perfect  declustering  of  the  input 

TD  lO 

chunks  across  all  processors,  each  processor  is  responsible  for  computation  operations  per  output 

tile. 


4.3  Computing  Operation  Counts  for  DA 

For  DA,  the  output  chunks  are  not  replicated,  so  the  effective  overall  system  memory  is  P  x  M.  Therefore, 
the  average  number  of  output  chunks  and  the  number  of  output  tiles  can  be  computed  as  follows. 

PM 
u 

U  _  Uu 
Bda  PM 

Similar  to  FRA,  the  expected  extent  of  an  output  tile  along  dimension  j,  Xj  for  j  =  0, 1, . . . ,  d  —  1,  and 
the  expected  number  of  input  chunks  out  of  I  input  chunks  that  map  to  a  given  output  tile,  Ada,  can  be 
computed  as  follows. 


Bda  = 
Tda  = 


Ada 


\/^Zj  forj  =  0,l,...,d- 1 
XI 

Tda 


During  the  local  reduction  phase  for  DA,  local  input  chunks  that  map  to  output  chunks  on  other  processors 
must  be  sent  to  those  processors.  As  a  result,  DA  requires  interprocessor  communication  for  input  chunks. 
We  estimate  the  number  of  messages  for  input  chunks  for  each  processor,  Imsg,  as  follows. 


1.  Compute  for  each  region  TZy  G  'F  of  an  output  tile  (see  Section  3.3)  the  expected  number  of  messages 
generated  for  an  input  chunk  that  belongs  to  Hy.  This  is  denoted  as  ly. 


2. 


Compute  Imsg 
/.’S. 


as  the  sum  of  all  ly ’s,  weighted  by  the  volumes  of  the  regions  that  correspond  to  the 

_  xp  vol(72t,) 

vol(an  output  tile) 


Equation  (2)  in  Section  3.3  computes  the  volume  of  region  Hy.  We  now  compute  ly,  the  expected  number 
of  messages  for  an  input  chunk  in  region  Hy  G  'F.  First,  let  C  be  defined  as  follows. 


C(k,P) 


P  -  1  if  A;  >  P 

(k  -  l)p  +  k(l  -  p)  otherwise 

P  -  1  if  A;  >  P 
TM};  otherwise 


For  an  input  chunk  that  maps  to  k  output  chunks  in  an  output  tile,  under  perfect  declustering,  those  k  output 
chunks  are  expected  to  be  distributed  to  as  many  processors  as  possible.  That  is,  if  A;  >  P,  then  the  k 
mapped  output  chunks  are  stored  on  P  processors;  otherwise,  they  are  stored  on  k  processors.  C(k,P) 
therefore  gives  the  expected  number  of  remote  processors  that  own  at  least  one  of  the  k  mapped  output 
chunks.  Note  that  this  is  also  the  expected  number  of  messages  that  are  generated  in  DA  for  an  input  chunk 
that  maps  to  k  output  chunks  in  an  output  tile. 
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We  now  continue  the  computation  of  ly  for  region  TZy.  First,  consider  the  scenario  where  input  chunks 
have  extents  smaller  than  that  of  an  output  tile.  See  Figure  4(h)  for  the  four  regions  of  an  output  tile  in 
this  scenario.  To  simplify  the  analysis,  for  each  of  the  four  regions,  we  designate  an  input  chunk  as  a 
representative  input  chunk  for  all  input  chunks  in  the  same  region,  and  use  the  expected  number  of  messages 
that  are  generated  hy  DA  for  the  representative  input  chunk  as  the  expected  number  of  messages  generated 
for  each  input  chunk  in  that  region. 

First  consider  region  i>  in  Figure  4(b).  Since  all  input  chunks  in  region  i>  map  to  only  one 
output  tile,  they  are  all  expected  to  generate  the  same  number  of  messages.  With  each  input  chunk  mapping 
to  a  output  chunks,  we  have 

/<i,i>  =  C(a,F) 

Now  let’s  consider  an  input  chunk  c  in  region  7^<i,2>'.  where  c  maps  to  two  output  tiles,  say  t  and  s. 
Under  the  assumption  that  the  output  dataset  is  a  regular  2-dimensional  dense  array,  the  number  of  output 
chunks  that  input  chunk  c  maps  to  in  output  tile  t  is  proportional  to  the  volume  (or  area  when  d  =  2)  of  the 
portion  of  c  that  falls  inside  output  tile  t.  Suppose  that  the  volume  of  the  portion  of  input  chunk  c  contained 
in  output  tile  i  is  cci ,  and  the  volume  of  the  portion  of  c  contained  in  output  tile  s  is  0^2,  and  vol(c)  -oji  +0^2- 
Then  the  expected  number  of  messages  generated  for  input  chunk  c  can  be  computed  as  the  sum  of  the 
expected  number  of  messages  for  c  in  output  tile  t  and  the  expected  number  of  messages  for  c  in  output  tile 
s,  which  is  equal  to  P)  +  P). 

Imagine  the  top-right  corner  of  input  chunk  c  slides  from  the  top  towards  the  bottom  of  region  72<i,2>  in 
output  tile  t.  Initially,  most  of  the  a  output  chunks  that  c  maps  to  belong  to  t.  As  c  slides  towards  the  bottom 
of  region  72<p2>  in  t,  c  overlaps  with  s,  and  therefore  some  of  the  a  output  chunks  now  belong  to  t.  When 
its  top-right  corner  is  located  at  the  mid-point  of  region  72<p2>>  input  chunk  c  is  evenly  split  between  t  and 
s.  As  a  result,  half  of  the  a  output  chunks  that  c  maps  to  belong  to  output  tile  t,  and  half  of  them  belong 
to  output  tile  s.  As  its  top-right  corner  moves  passed  the  mid-point  of  region  72<p2>  towards  the  bottom 
of  the  region  in  t,  more  and  more  of  the  a  output  chunks  that  c  maps  to  belong  to  output  tile  s.  One  can 
obviously  see  that  as  c  slides  from  the  top  towards  the  bottom  of  region  72<i^2>  in  t,  the  expected  number  of 
messages  for  c  first  increases,  maxs  out  at  the  mid-point  of  72<  y2>  ■>  and  then  decreases.  To  pick  a  reasonable 
representative  input  chunk  for  region  72<p2>>  we  choose  the  input  chunk  whose  top-right  corner  is  located 
at  one-quarter  below  the  top  of  region  72<p2>  and  one-quarter  to  the  left  of  the  right-boundary  of  region 
though  the  input  chunk  whose  top-right  corner  is  located  at  one-quarter  above  the  bottom  of  region 
^<i,2>  is  equally  desired.  Figure  6  shows  the  top-right  corners  of  the  four  representative  input  chunks  for 
the  four  regions  of  an  output  tile.  As  discussed  earlier,  all  input  chunks  in  region  72<i^i>  generate  the  same 
number  of  messages,  and  hence  any  input  chunk  in  that  region  can  be  chosen  as  the  representative  input 
chunk.  For  convenience,  we  choose  the  one  with  its  top-right  corner  located  at  one-quarter  from  the  top  and 
one-quarter  from  the  right  boundaries  of  region  72<i,i>.  The  top-right  corners  of  the  representative  input 
chunks  for  the  other  two  regions,  72<2,i>  and  72<2,2>i  are  chosen  in  a  similar  way  as  for  region  72<y2>-  The 
number  of  messages  that  these  four  representative  input  chunks  generate  are  used  for  the  expected  numbers 
of  messages  for  input  chunks  in  the  four  regions,  and  they  are  computed  as  follows. 

T<i,i>  = 

T<i,2>  = 

T<2,i>  = 

T<2,2>  = 


C(a,P) 

C(^^a,P)4 

yoyi 

yoyi 

^(  4^04^1  ^,p) 


C(M^a,  P)  =  C(^a,  P)  +  C(\a,  P) 

yoyi  4  4 


1, 


C(i^a,  P)  =  C(^a,  P)  +  C(ia,  P) 


■c( 


yoyi 

lyolyi 


a,P) 


C(  4^04^1  a,  P) 


c(Mzi«,p) 


V()V\ 
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yO/4^  (x0-y0)/4 


xl-yl 


yi 


1 

R<2,1> 

R<1.1> 

'I 

R<2,2> 

R  ^ 

^  <1,2> 

(xl-yl)/4 


yl/4 


yO  xO-yO 


Figure  6:  The  top-right  comers,  shown  as  shaded  triangles,  of  the  four  representative  input  chunks  for  the 
four  regions  of  an  output  tile,  under  the  assumption  that  input  chunk  extents  are  smaller  than  that  of  an 
output  tile. 


-  C{—a,P)^2C{—a,P)^C{—a,P) 


Note  that  the  representative  chunks  in  region  i>,  7^<i,2>>  ^<2,i>  and  7^<2,2>  map  to  one,  two,  two 
and  four  output  tiles,  respectively.  That  is  why  /<i,i>,  /<i,2>,  ^<2,i>  and  /<2,2>  are  computed  as  sums  of 
one,  two,  two  and  four  terms,  respectively. 

Now  consider  the  scenario  where  input  chunks  have  extents  larger  than  that  of  an  output  tile.  Let’s  start 
with  the  2-dimensional  case.  Similar  to  the  analysis  in  the  scenario  where  input  chunks  have  small  extents, 
we  choose  a  representative  input  chunk  for  each  of  the  four  regions  shown  in  Figure  5(h).  The  top-right 
corners  of  those  representative  input  chunks  are  chosen  the  same  way  they  were  chosen  for  the  smaller 
input  chunk  scenario,  and  Figure  7  shows  the  four  representative  input  chunks  for  an  example  where  ro  =  3 
and  ri  =  1.  The  expected  number  of  messages  for  the  four  representative  input  chunks  for  the  example  in 
Figure  7  are  given  helow. 


^<ro,r\> 


(ro-2)C(^,P)  +  C( 

yoy\ 


yo-(ro-A)xo 

4  yi 

y^yi 


o,  P)  -f  C( 


3M-(3ro-4)a;o  _ 
4  yl 

yoyi 


a,P) 


/  i^o- ^)xo]yi  (j-o- l)a;o]yi 

(ro  -  1)C( - ,P)  -f  - a,P)  +  - a,P) 

yoyi  yoyi  yoyi 

(ro  -  2)C(^,  P)  +  (ro  -  2)C(^,  P) 

yoyi  yoyi 


■C( 


yo-(ro-A)xo  3^, 
4  471 


o,  P)  -f  C( 


yo-(ro-A)xo  1  _ 
4  4  71 


Q!,P) 


yoyi  yoyi 

3M-(3ro-4)a;o  3  3j/o-(3ro-4)a;o  1 

■  C(  ^  P)  +  C(  ^  P) 

yoyi 


yoyi 


'<>’0-|-l,I’l-|-l> 


(ro  -  1)C(^,  P)  +  (ro  -  1)C(^,  P) 

yoyi  yoyi 


yoyi 


yoyi 
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yO-(rO-l)xO  rOxO-yO 
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Figure  7:  The  representative  input  chunks,  shown  as  the  shaded  rectangles,  for  region  77<2,i>  (top-left), 
77<i,i>  (top-right),  77<2,2>  (hottom-left)  and  77<i,2>  (hottom-right),  with  tq  =  3  and  ri  =  1. 


^  1  ^  t[s/.  -  (r„  -  1  )»o]ti„ 

yoyi  ym 

For  the  d-dimensional  case,  one  could  apply  the  same  analysis  for  the  2-dimensional  case  to  each  of  the  d 
dimensions  separately,  and  combine  the  results  from  the  d  dimensions  into  the  final  result  as  follows.  All 
representative  input  chunks  of  an  output  tile,  when  projected  onto  dimension  j,  becomes  two  segments:  one 
corresponds  to  representative  input  chunks  in  regions  ^  and  the  other  corresponds  to  representative 

input  chunks  in  regions  72<...,r  We  refer  to  these  two  segments  as  Sr  and  Sr  +i.  Depending  on 

the  value  of  r^,  each  of  the  two  segments  would  intersect  one  or  many  output  tiles  along  dimension  j,  and 
for  each  intersection,  the  length  of  the  overlap  between  a  segment  and  an  output  tile  determines  in  part  how 
many  messages  DA  generates  for  a  representative  input  chunk  when  processing  that  output  tile.  As  one 
will  see,  for  each  segment,  there  are  at  most  three  different  lengths.  Define  e^{k^  /),  where  k  G  {vj,  r-j  +  1} 
and  I  G  {0, 1,2},  as  the  three  possible  lengths  of  the  overlap  between  Sk  and  the  output  tiles  Sk  intersects, 
and  let  Cj  (A;,  /)  be  the  number  of  output  tiles  that  Sk  intersects  at  the  length  of  ej(k^l).  We  now  look  at  two 
different  cases,  =  1  and  Tj  >  2,  and  for  each  case,  compute  Cj(k^l)  and  ej(k,l). 

Case  1  Tj  =  1  (ie  <  x^).  Figure  8  shows  the  two  segments  obtained  by  projecting  MBRs  of  the 
representative  input  chunks  onto  dimension  j  when  the  extents  of  input  chunks  are  smaller  than  that 
of  an  output  tile. 

•  Segment  Sr^  is  entirely  contained  within  one  output  tile,  and  hence  the  length  of  the  overlapping 


17 


(xj-yj)/4 


:1 

xj-yi 

1 "  1 

1 

1  :  1 
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<...,rj,  ...> 
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<...,rj+l,  ...> 


yj/4  yj/4 


Figure  8:  Two  segments,  Sr  on  top  and  Sr  +i  at  bottom,  obtained  by  projecting  the  MBRs  of  all  represen¬ 
tative  input  chunks  onto  dimension  j,  when  the  input  chunk  extents  are  smaller  than  that  of  an  output  tile. 
The  solid  line  in  the  middle  represents  the  extents  of  two  output  tiles  projected  onto  dimension  j.  Segment 
Srj  is  entirely  contained  within  one  output  tile,  while  segment  intersects  two  output  tiles. 


segment  between  Sr^  and  the  output  tile  is  .  Therefore,  we  have 


1  if  /  =  0 

0  otherwise 


if  /  =  0 

0  otherwise 


Segment  intersects  one  output  tile  at  length  ^yj  and  another  at  length  ^y^.  Therefore,  we 
have 


+ 17  0 


1  if  /  =  0, 1 

0  otherwise 


+ 17  0 


^  hj 

0 


if  /  =  0 
if  /  =  1 
otherwise 


Case  2:  >  2  (ie  y^  >  Xj).  Figure  9  shows  the  two  segments  obtained  by  projecting  the  MBRs  of  the 

representative  input  chunks  onto  dimension  j  when  the  extents  of  input  chunks  are  larger  than  that  of 
an  output  tile. 


•  Segment  Sr  intersects  one  tile  at  length  ^ —  2  output  tiles  at  length  Xj ,  and  one 
output  tile  at  length  ^ — (1L2_1)£i  Therefore,  we  have 


CjirjJ)  =  < 


rj-2 


4 

=  3y,-{3r,-4)x, 

4 

0 


if  /  =  0,2 
if  /  =  1 
otherwise 

if  /  =  0 
if  /  =  1 
if  /  =  2 
otherwise 
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[3yj-(3rj-4)xj]/4 


[yj-(rj-4)xj]/4 


R<... 


Xj  (rj-2)xj 

t  '  t 

[yj-(rj-l)xj]/4  3[yj-(rj-l)xj]/4 

Figure  9:  Two  segments,  Sr^  on  top  and  at  bottom,  obtained  by  projecting  the  MBRs  of  all  represen¬ 
tative  input  chunks  onto  dimension  j,  when  the  input  chunk  extent  is  larger  than  that  of  an  output  tile.  The 
solid  line  in  the  middle  represents  the  extents  of  r-j  +  1  output  tiles  projected  onto  dimension  j.  Segment 
Sr^  intersects  Tj  output  tiles,  while  segment  intersects  Tj  +  1  output  tiles. 


Cj ,  e  j 

/  =  0 

/  =  1 

1  =  2 

Tj  =  1 

1 

0 

0 

ej{rj,l) 

Vj 

0 

0 

+ 17  0 

1 

1 

0 

+ 17  0 

ivj 

aV] 

Tj  >  2 

1 

Tj  -  2 

1 

ej(rj,l) 

4 

Xj 

4 

+ 17  0 

1 

Tj  -  1 

1 

+ 17  0 

jiV]  -  (rj  -  l)®j] 

Xj 

1 

1 

Table  2:  Summary  of  the  values  for  Cj{k,  1)  and  e^{k^  1)  where  k  =  -\-  1  and  /  =  0, 1, 2. 

•  Segment  intersects  one  tile  at  length  |[yo  -  {rj  -  l)xj],rj  -  1  output  tiles  at  length  Xj, 
and  one  output  tile  at  length  ^[yj  —  (r-j  —  l)a;i].  Therefore,  we  have 


Cj(rj  +  !,/)=< 


1 

0 


Tj  -  1 


if  /  =  0, 2 
if  /  =  1 
otherwise 


1 

1 

if  /  =  0 

Xj 

if  /  =  1 

1 

1 

if  /  =  2 

0 

otherwise 

+  I7 0  -  ' 


Table  2  summarizes  the  values  of  Cj(k^l)  and  ej(k,l).  Define  &d  as  the  set  of  all  possible  d-dimensional 
vectors  where  the  domain  of  each  element  of  these  vectors  is  {0, 1,2}.  That  is. 


&d  =  {<  no,ni, .  ..,nd-i  >  G  {0, 1,2}  for  j  =  0, 1, . . . ,  d  -  1} 
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For  given  ro,  ri, . . . ,  r^-i,  I<vo,vi,...,va_i>  can  be  computed  as  follows. 


<Oo,Oi,...,0a_i>e&d 


d-i 

J=0 


d-l 

Cia['[lejivj,0j)],P) 

j=0 


where  Vj  G  {r^,  +  1}  for  j  =  0, 1, 2, . . . ,  d  —  1.  Now  that  we  can  compute  ly  for  each  of  the  2'^  regions 

of  an  output  tile,  the  expected  number  of  messages  for  an  input  chunk,  m,  can  be  computed  as  follows. 


m  = 


vol(72^)  ^  1 

XqX  1  *  *  *  X (i—  \  ) 


where  Yol{TZy)  is  computed  by  Equation  (2)  in  Section  3.3.  With  A^a  input  chunks  per  output  tile,  Imsg->  the 
expected  number  of  input  chunk  messages  for  a  processor  per  tile,  can  be  computed  as  follows. 

7-  ^da 

J-msg  —  p  dTl 

_  J^da  r  Vo1(727,)  ^  1 

P  l^O^l  •  •  -Xd-l 


D 

Assuming  perfect  declustering,  each  processor  reads  output  chunks  during  the  initialization  phase, 
and  input  chunks  during  the  local  reduction  phase.  Each  processor  sends  Lmsg  messages  for  input 
chunks  during  the  the  local  reduction  phase.  Similar  to  ERA,  since  each  output  chunk  is  mapped  to  by  [3 
input  chunks,  Bdafi  computation  operations  are  carried  out  in  total  for  an  output  tile  of  B da  output  chunks. 
Assuming  perfect  declustering  of  the  input  chunks  across  all  processors,  each  processor  is  responsible  for 
computation  operations  per  output  tile. 


5  Cost  Model  Validation 

In  this  section  we  validate  our  cost  models  with  queries  for  synthetic  datasets  and  for  several  driving 
applications. 

We  first  use  synthetic  datasets  to  evaluate  the  cost  models  under  controlled  scenarios.  The  output 
dataset  is  a  2-dimensional  rectangular  array.  The  entire  output  attribute  space  is  regularly  partitioned  into 
non-overlapping  rectangles,  with  each  rectangle  representing  an  output  chunk  in  the  output  dataset.  The 
input  dataset  has  a  3-dimensional  attribute  space,  and  input  chunks  were  placed  in  the  input  space  randomly 
with  a  uniform  distribution.  The  assignment  of  input  and  output  chunks  to  the  disks  was  done  using  a  Hilbert 
curve-based  declustering  algorithm  [5].  In  these  experiments  the  size  of  the  input  and  output  datasets  were 
fixed.  The  output  dataset  size  is  set  at  400MB,  with  1600  output  chunks.  The  input  dataset  size  is  set  at 
l.hGBytes.  We  varied  the  number  and  extent  of  input  chunks  to  produce  two  sets  of  queries:  one  set  with 
a  =  (3,  and  the  other  with  So  =  /3,  where  a  is  the  number  of  output  chunks  that  an  input  chunk  maps 
to,  and  l3  is  the  number  of  input  chunks  that  an  output  chunk  is  mapped  to.  Eor  each  set  of  queries,  a 
is  set  to  1.5,  4,  9,  and  16.  We  set  the  computation  time  to  1  millisecond  for  processing  an  output  chunk 
in  the  initialization,  global  combine,  and  output  handling  phases,  and  to  5  milliseconds  for  processing 
each  intersecting  (input,output)  chunk  pair  in  the  local  reduction  phase.  The  number  of  ADR  back-end 
processors  is  varied  from  8,  16,  32,  64  to  128.  Eor  a  given  number  of  processors,  a  query  plan  is  generated 
for  each  query  under  each  of  the  three  strategies.  Actual  operation  counts  for  a  given  query  are  obtained 
by  scanning  through  the  query  plan  and  selecting  the  maximum  counts  among  all  the  processors,  while 
estimated  operation  counts  per  processor  are  computed  from  the  cost  models.  These  actual  and  estimated 
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Figure  10:  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  a  =  fi  =  \. 5. 

counts  are  turned  into  FO  volume,  computation  time  and  communication  volume,  and  compared  against 
each  other.  Figure  10-17  show  the  actual  and  estimated  values  for  FO  volume,  computation  time  and 
communication  volume  for  the  two  sets  of  synthetic  queries.  As  is  seen  from  the  figures,  the  cost  models 
are  able  to  accurately  estimate  the  FO  volume,  computation  time  and  communication  volume  for  the  query 
processing  strategies  for  different  a  and  /3  values  for  varying  number  of  processors. 

Figure  10-17  show  that  both  FRA  and  SRA  read  more  data  than  DA  does,  and  this  is  because  FRA  and 
SRA  use  part  of  the  system  memory  for  ghost  chunks  and  therefore  generates  more  output  tiles  than  DA 
does.  As  a  result,  input  chunks  are  retrieved  from  disks  more  times  in  FRA  and  SRA  than  they  are  in  DA. 
The  figures  also  show  that  the  computation  time  decreases  for  all  strategies  as  the  number  of  processors 
increases.  This  is  because  the  computation  operations  are  distributed  among  all  processors  and  therefore 
as  the  number  of  processors  increases,  each  processor  is  responsible  for  less  computation.  Due  to  the 
computation  overhead  for  initializing  the  ghost  chunks  during  the  initialization  phase  and  for  combining 
the  ghost  chunks  during  the  global  combine  phase,  both  FRA  and  SRA  perform  more  computation  than 
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Figure  1 1 :  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  a  =  fi  =  4. 
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Figure  12:  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  a  =  (3  =  9. 
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Figure  13:  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  a  =  fi  =  \6. 
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Figure  14:  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  a  =  1.5, /3  =  12. 
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Figure  15:  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  a  =  4,  /3  =  32. 
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Figure  16:  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  a  =  9,  fi  =  12. 
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Figure  17:  Actual  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  the  synthetic  query  with  o  =  16,  /3  =  128. 
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DA.  SRA  often  performs  less  computation  than  FRA  because  SRA  may  replicate  fewer  ghost  chunks  and 
therefore  incur  less  computation  overhead.  This  is  confirmed  by  the  observation  that  whenever  SRA  incurs 
less  computation  overhead  than  FRA  ,  SRA  also  generates  less  communication  volume.  In  fact,  the  cost 
model  predicts  that  SRA  behaves  exactly  the  same  as  FRA  until  the  point  where  [3  =  P,  after  which  an 
output  chunk  is  only  mapped  to  by  /3  input  chunks,  and  therefore  is  only  replicated  by  SRA  on  at  most  /3 
processors.  FRA,  on  the  other  hand,  always  replicates  an  output  chunk  on  all  P  processors,  and  therefore 
generates  more  communication  volume  as  the  number  of  processors  increases.  The  predicted  relationship 
between  FRA  and  SRA  is  most  clearly  displayed  by  the  estimated  communication  volume  in  Figure  15. 
The  actual  communication  volume  shown  in  Figure  15,  however,  indicates  that  the  actual  behaviour  of 
SRA  tends  to  depart  from  that  of  FRA  even  with  /3  <  P.  This  is  because  the  cost  model  assumes  perfect 
declustering  of  the  input  chunks  that  map  to  an  output  chunk,  which  in  practice  is  not  achieved.  As  a  result, 
SRA  replicates  an  output  chunk  on  fewer  than  fj  processors,  and  therefore  generates  less  communication 
volume  that  what  the  cost  model  predicts.  Note  that  due  to  the  same  reason,  the  cost  model  for  DA  does  not 
accurately  estimate  the  communication  volume  for  16  processors  for  the  query  with  0=16  and  /3  =  128, 
as  seen  in  Figure  17.  The  cost  model  assumes  perfect  declustering  of  the  output  chunks  that  an  input 
chunk  maps  to.  Thus,  with  a  =  16,  an  input  chunk  on  one  processor  is  expected  to  be  sent  to  fifteen  other 
processors.  In  practice,  however,  perfect  declustering  is  not  achieved,  and  an  input  chunk  is  sent  to  fewer 
than  fifteen  processors.  As  a  result,  the  actual  communication  volume  is  less  than  what  the  cost  model 
predicts. 

We  have  also  evaluated  the  cost  models  for  different  application  scenarios,  varying  the  number  of 
processors  and  the  input  dataset  size.  We  used  application  emulators  [11]  to  generate  various  application 
scenarios  for  the  applications  classes  that  motivated  the  design  of  ADR  (see  Section  1).  An  application 
emulator  provides  a  parameterized  model  of  an  application  class;  adjusting  the  parameter  values  makes  it 
possible  to  generate  different  application  scenarios  within  the  application  class  and  scale  applications  in  a 
controlled  way.  The  assignment  of  both  input  and  output  chunks  to  the  disks  was  done  using  a  Hilbert  curve 
based  declustering  algorithm  [5]. 

Table  3  summarizes  dataset  sizes  and  application  characteristics  for  three  application  classes;  satellite 
data  processing  (SAT)  [4],  analysis  of  microscopy  data  with  the  Virtual  Microscope  (VM)  [1],  and  water 
contamination  studies  (WCS)  [8].  The  output  dataset  size  was  a  fixed  size  for  each  application.  The  last 
column  shows  the  computation  time  per  chunk  for  the  different  phases  of  query  execution  (see  Section  2); 
I-LR-GC-OH  represents  the  Initialization-Local  Reduction-Global  Combine-Output  Handling  phases.  The 
computation  times  shown  represent  the  relative  computation  cost  of  the  different  phases  within  and  across 
the  different  applications.  The  LR  value  denotes  the  computation  cost  for  each  intersecting  (input  chunk, 
accumulator  chunk)  pair.  Thus,  an  input  chunk  that  maps  to  a  larger  number  of  accumulator  chunks  takes 
longer  to  process.  In  all  of  these  applications  the  output  datasets  are  regular  arrays,  hence  each  output 
dataset  is  divided  into  regular  multi-dimensional  rectangular  regions.  The  distribution  of  the  individual  data 
items  and  the  data  chunks  in  the  input  dataset  for  SAT  is  irregular.  This  is  because  of  the  polar  orbit  of  the 
satellite  [10];  the  data  chunks  near  the  poles  are  more  elongated  on  the  surface  of  the  earth  than  those  near 
the  equator  and  there  are  more  overlapping  chunks  near  the  poles.  The  input  datasets  for  WCS  and  VM 
are  regular  dense  arrays  that  are  partitioned  into  equal-sized  rectangular  chunks.  We  selected  the  values 
for  the  various  parameters  to  represent  some  typical  scenarios  for  these  application  classes,  based  on  our 
experience  with  the  complete  applications. 

Figures  1 8-20  show  the  measured  and  estimated  values  for  I/O  volume,  computation  time  and  commu¬ 
nication  volume  for  each  application.  As  is  seen  from  the  figures,  the  cost  models  are  able  to  estimate  the 
volumes  of  I/O  and  communication  in  most  application  scenarios.  However,  the  cost  models  fail  to  estimate 
the  computation  times  of  the  strategies  for  the  SAT  and  WCS  applications.  Our  experiments  show  that 
in  these  two  applications  there  is  a  load  imbalance  in  the  computation  assigned  to  the  various  processors. 
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App. 

Input  Dataset 

Output  Dataset 

Average 

/? 

Average 

a 

Computation 
(in  milliseconds) 
I-LR-GC-OH 

Num.  of 
Chunks 

Total 

Size 

Num.  of 
Chunks 

Total 

Size 

SAT 

9K 

1.6GB 

256 

25MB 

161 

4.6 

1-40-20-1 

WCS 

7.5K 

1.7GB 

150 

17MB 

60 

1.2 

1-20-1-1 

VM 

16K 

1.5GB 

256 

192MB 

64 

1.0 

1-5-1-1 

Table  3:  Application  characteristics. 


There  are  two  main  reasons  for  the  load  imbalance  in  these  applications.  First,  the  distribution  of  data 
elements  in  the  output  attribute  space  is  not  uniform  for  SAT.  Second,  the  Hilbert  curve-based  declustering 
algorithms  do  not  achieve  optimal  distribution  of  the  input  and  output  chunks  across  the  processors,  causing 
load  imbalance  in  some  cases.  Since  the  cost  models  assume  perfect  declustering  and  a  uniform  distribution 
of  the  computations  across  the  processors,  the  models  may  fail  to  predict  the  relative  computation  times  of 
the  strategies  in  those  cases. 


6  Conclusion 

We  have  presented  cost  models  to  estimate  the  average  operation  counts  for  three  query  processing  strategies, 
FRA,  SRA  and  DA,  in  ADR.  These  cost  models  allow  us  to  estimate  the  average  I/O  volume,  computation 
time  and  communication  volume  for  each  processor.  We  have  also  validated  our  cost  models  with  queries 
for  synthetic  datasets  and  queries  for  three  driving  applications.  Our  experiments  show  that  our  cost  models 
are  able  to  accurately  estimate  I/O  volume,  the  computation  time  and  communication  volume  for  each  of 
the  three  query  processing  strategies. 

However,  the  ultimate  goal  of  our  research  is  to  predict  the  relative  query  execution  time  for  each  query 
processing  strategy  so  that  for  a  given  query  and  machine  configuration,  the  ADR  planning  service  is  able 
to  choose  the  query  processing  strategy  that  would  process  the  query  in  the  least  amount  of  time.  The 
cost  models  presented  in  this  paper  are  able  to  accurately  predict  the  I/O  volume,  computation  time  and 
communication  volume  for  the  three  query  processing  strategies,  but  stop  short  of  estimating  the  actual 
execution  time.  One  solution  is  use  the  I/O  and  communication  bandwidths  of  the  machine  that  ADR  runs 
on  to  turn  I/O  and  communication  volumes  into  I/O  and  communication  times,  and  the  average  computation 
time  for  the  data  processing  functions  specified  by  fhe  given  query  to  turn  computation  operation  counts 
into  computation  time.  The  sum  can  then  be  used  as  the  estimated  execution  time.  I/O  and  communication 
bandwidths  can  be  measured  by  running  a  set  of  sample  queries  and  use  the  average  bandwidths  observed 
by  those  queries.  Computation  time  of  the  data  processing  functions  can  be  obtained  from  statistics  gathered 
either  during  ADR  startup  time  or  from  earlier  queries  that  invoke  the  same  query  processing  functions. 
We  are  currently  working  on  a  machine  model  that  when  combined  with  the  cost  models  presented  in  this 
paper,  can  be  used  to  estimate  the  relative  query  executing  times  of  the  three  strategies. 

Note  that  the  cost  models  that  we  have  presented  in  this  paper  assume  that  the  input  data  chunks 
are  uniformly  distributed  in  the  output  attribute  space,  and  that  the  output  chunks  form  a  regular  multi¬ 
dimensional  grid.  In  scenarios  where  these  assumptions  do  not  hold,  such  as  the  SAT  application  described 
in  Section  5,  an  inspector  code  can  be  used  to  generate  a  partial  query  plan  for  each  of  the  three  strategies 
and  estimate  the  relative  query  execution  times  of  the  strategies  based  on  information  gathered  from  the 
partial  query  plans.  The  full  query  plan  for  the  strategy  that  is  predicted  with  the  smallest  execution  time  is 
then  generated,  and  the  query  can  be  processed  as  planned.  We  are  currently  evaluating  the  effectiveness  of 
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Figure  18:  Measured  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  SAT  application. 
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Figure  19:  Measured  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  WCS  application. 
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Figure  20:  Measured  (left)  and  estimated  (right)  FO  volume,  computation  time  and  communication  volume 
for  VM  application. 
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such  an  inspector. 
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